Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize strings transformation #371

Merged
merged 65 commits into from
May 26, 2021
Merged

Conversation

binh-dam-ibigroup
Copy link
Contributor

@binh-dam-ibigroup binh-dam-ibigroup commented Mar 22, 2021

Checklist

  • Appropriate branch selected (all PRs must first be merged to dev before they can be merged to master)
  • Any modified or new methods or classes have helpful JavaDoc and code is thoroughly commented
  • [na] The description lists all applicable issues this PR seeks to resolve
  • The description lists any configuration setting(s) that differ from the default settings
  • All tests and CI builds passing

Description

This PR adds functionality for normalizing strings in a GTFS table:

  • Capitalize strings, with exceptions
  • Substitute text/words, with optional space normalization around words, if substituting words.
    In addition:
  • ZipTransform and DbTransform are refactored to pass the correct FeedTransformTarget type to the implementers
  • Parameter validation is extracted from the transformation code itself.

Two optional configuration parameters are introduced:

  • DEFAULT_CAPITALIZATION_EXCEPTIONS, a comma-separated list of words as they should appear after capitalization (acronyms, cardinal points)
  • DEFAULT_SUBSTITUTIONS, a comma-separated list of text and its replacement, in the format old => new, or old =>+ new if space needs to be normalized around new.

Things that have yet to be resolved:

  • There is a "convert to sentence case" stub/option in NormalizeFieldTransformation, is it important?
  • Should we support for the GTFS table files inside a folder in the GTFS ZIP archive? e.g. gtfs.zip/google_transit/routes.txt

landonreed and others added 17 commits February 19, 2021 17:16
…' into normalize-strings-transformation

# Conflicts:
#	src/main/java/com/conveyal/datatools/manager/models/transform/NormalizeFieldTransformation.java
#	src/test/java/com/conveyal/datatools/manager/models/transform/NormalizeFieldTransformationTest.java
@codecov-io
Copy link

codecov-io commented Mar 31, 2021

Codecov Report

Merging #371 (e03367f) into dev (f0a3183) will increase coverage by 0.92%.
The diff coverage is 67.73%.

Impacted file tree graph

@@             Coverage Diff              @@
##                dev     #371      +/-   ##
============================================
+ Coverage     26.61%   27.53%   +0.92%     
- Complexity      649      688      +39     
============================================
  Files           178      180       +2     
  Lines          9188     9354     +166     
  Branches       1243     1260      +17     
============================================
+ Hits           2445     2576     +131     
- Misses         6504     6533      +29     
- Partials        239      245       +6     
Flag Coverage Δ Complexity Δ
unit_tests 27.53% <67.73%> (+0.92%) 688.00 <47.00> (+39.00)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
.../conveyal/datatools/manager/models/CustomFile.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...ls/manager/models/transform/ZipTransformation.java 35.29% <31.25%> (-64.71%) 6.00 <5.00> (+5.00) ⬇️
...com/conveyal/datatools/manager/jobs/DeployJob.java 18.31% <47.25%> (+0.69%) 24.00 <14.00> (+8.00)
...ols/manager/models/transform/DbTransformation.java 53.12% <51.61%> (-46.88%) 4.00 <3.00> (+3.00) ⬇️
...ransform/ReplaceFileFromVersionTransformation.java 82.75% <88.88%> (+16.09%) 6.00 <3.00> (+2.00)
...models/transform/NormalizeFieldTransformation.java 90.00% <90.00%> (ø) 21.00 <21.00> (?)
.../datatools/manager/jobs/ArbitraryTransformJob.java 91.66% <100.00%> (ø) 3.00 <0.00> (ø)
...eyal/datatools/manager/jobs/OtpRunnerManifest.java 100.00% <100.00%> (ø) 1.00 <0.00> (ø)
.../conveyal/datatools/manager/models/Deployment.java 4.09% <100.00%> (+0.96%) 4.00 <0.00> (-1.00) ⬆️
.../models/transform/DeleteRecordsTransformation.java 74.19% <100.00%> (+12.79%) 5.00 <0.00> (-1.00) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0a3183...e03367f. Read the comment docs.

Copy link
Contributor

@evansiroky evansiroky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I approve overall of this PR. However, I think we may want to rethink whether we should have the transform rules be a list within the FeedSource object. I think it deserves its own mongo collection because it is probably unnecessary to fetch all transformation data whenever a feed source is fetched. Also, it might also make sense to combine the FeedTransformRules, FeedRetrievalMethod and FeedTransformation classes together into a single FeedTransformation class. These design decisions were made prior to this PR, but I think moving forward, we should seriously consider refactoring this and avoid making additional lists within model classes unless a very small amount of data is expected within these lists.

Copy link
Contributor

@evansiroky evansiroky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just seeing a failing build and my comment about a PR with a proposed refactor is now changing my mind to request changes.

…on-eas

Combine TypedFeedTransformation and FeedTransformation
@binh-dam-ibigroup
Copy link
Contributor Author

In general, I approve overall of this PR. However, I think we may want to rethink whether we should have the transform rules be a list within the FeedSource object. I think it deserves its own mongo collection because it is probably unnecessary to fetch all transformation data whenever a feed source is fetched. Also, it might also make sense to combine the FeedTransformRules, FeedRetrievalMethod and FeedTransformation classes together into a single FeedTransformation class. These design decisions were made prior to this PR, but I think moving forward, we should seriously consider refactoring this and avoid making additional lists within model classes unless a very small amount of data is expected within these lists.

I have created issue #380 with your comments above.

Copy link
Contributor

@br648 br648 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but looks good.


@Override
public void validateParameters(MonitorableJob.Status status) {
// Does nothing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method needed then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is needed because validateParameters is now abstract in the parent class. Is @Override needed though?

@br648 br648 removed their assignment May 11, 2021
@evansiroky evansiroky removed their assignment May 13, 2021
@landonreed landonreed merged commit d80f10f into dev May 26, 2021
@landonreed landonreed deleted the normalize-strings-transformation branch May 26, 2021 20:16
@binh-dam-ibigroup binh-dam-ibigroup mentioned this pull request May 28, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants